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1 . Type of Application 

This new application is for a(n) (check one applicable item below): 
0 Original (nonprovisional) 

□ Design 

□ Plant 

WARNING: Do not use this transmittal for a completion in the U.S. of an International Application under 35 U.S.C. 

371(c)(4) unless the International Application is being filed as a divisional, continuation or continuation-in- 
part application. 

WARNING: Do not use this transmittal for the filing of a provisional application. 
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I hereby certify that this New Application Transmittal and the documents referred to as enclosed therein are being deposited 
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Addressee" Mailing Label Number EL699731 101US addressed to the: Assistant Commissioner of Patents, Washington, 
D.C. 20231 
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EXPRESS MAIL LABEL 
NO.:EL699731101US 



2. Benefit of Prior U.S. Application(s) (35 U.S.C. 119(e), 120, or 121) 



NO TE: If the new application being transmitted is a divisional, continuation or a continuation-in-part of a parent case, or 
where the parent case is an International Application which designated the U. S., or benefit of a prior provisional 
application is claimed, then check the following item and complete and attach ADDED PAGES FOR NEW 
APPLICATION TRANSMITTAL WHERE BENEFIT OF PRIOR U.S. APPLICATION(S) CLAIMED. 



If an application claims the benefit of the filing date of an earlier filed application under 35 U. S. C. 120, 121 
or 365(c), the 20-year term of that application will be based upon the filing date of the earliest U.S. 
application that the application makes reference to under 35 U.S.C. 120, 121 or 365(c). (35 U.S.C. 
154(a)(2) does not take into account, for the determination of the patent term, any application on which 
priority is claimed under 35 U.S.C. 119, 365(a) or 365(b).) For a c-i-p application, applicant should review 
whether any claim in the patent that will issue is supported by an earlier application and, if not, the 
applicant should consider canceling the reference to the earlier filed application. The term of a patent is not 
based on a claim-by-claim approach. See Notice of April 14, 1995, 60 Fed. Reg. 20, 195, at 20,205. 

When the last day of pendency of a provisional application falls on a Saturday, Sunday, or Federal holiday 
within the District of Columbia, any nonprovisional application claiming benefit of the provisional must be 
filed prior to the Saturday, Sunday or Federal holiday within the District of Columbia. See 37 C.F.R. § 
1. 78(a)(3). 



□ The new application being transmitted claims the benefit of prior U.S. application(s) 
and enclosed are ADDED PAGES FOR NEW APPLICATION TRANSMITTAL WHERE 
BENEFIT OF PRIOR U.S. APPLICATION(S) CLAIMED. 



NOTE: If one of the following 3 items apply, then complete and attach ADDED PAGES FOR NEW APPLICATION 
TRANSMITTAL WHERE BENEFIT OF A PRIOR U. S. APPLICA TION CLAIMED and a NOTIFICA TION IN PARENT 
APPLICA TION OF THE FILING OF THIS CONTINUA TION APPLICA TION. 

□ Divisional. 

□ Continuation. 

□ Continuation-in-Part (C-I-P). 

3. Papers Enclosed That Are Required For Filing Date Under 37 CFR 1 .53 (Regular) or 37 CFR 
1.153 (Design) Application 

24 Pages of specification 

6 Pages of claims 

1 Pages of Abstract 

_ Sheets of drawing 

□ formal 

□ informal 

WARNING: DO NOT submit original drawings. A high quality copy of the drawings should be supplied when filing a 
patent application. The drawings that are submitted to the Office must be on strong, white, smooth, and 
non-shiny paper and meet the standards according to § 1.84. If corrections to the drawings are necessary, 
they should be made to the original drawing and a high-quality copy of the corrected original drawing then 
submitted to the Office. Only one copy is required or desired. Comments on proposed new 37 CFR 1.84. 
Notice of March 9, 1988 (1990 O.G. 57-62). 



NOTE: "Identifying indicia, if provided, should include the application number or the title of the invention, inventor's name, 
docket number (if any), and the name and telephone number of a person to call if the Office is unable to match 
the drawings to the proper application. This information should be placed on the back of each sheet of drawing 
a minimum distance of 1.5 cm. (% inch) down from the top of the page. " 37 C.F.R. 1.84(c). 

(complete the following, if applicable) 

□ The enclosed drawing(s) are photograph(s), and there is also attached a "PETITION TO 
ACCEPT PHOTOGRAPH(S) AS DRAWING(S)". 37 C.F.R. 1.84(b). 
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Additional papers enclosed 

□ Preliminary Amendment 

□ Information Disclosure Statement (37 CFR 1 .98) 



□ Form PTO- 1449 

□ Citations 

□ Declaration of Biological Deposit 

□ Submission of "Sequence Listing," computer readable copy and/or amendment 
pertaining thereto for biotechnology invention containing nucleotide and/or amino acid 
sequence. 

□ Authorization of Attorney(s) to Accept and Follow Instructions from Representative 

□ Special Comments 

□ Other 

5. Declaration or oath 

□ Enclosed 

executed by (check all applicable boxes) 

□ inventors. 

□ legal representative of inventors. 37 CFR 1 .42 or 1 .43 

□ joint inventor or person showing a proprietary interest on behalf of inventor who 
refused to sign or cannot be reached. 

□ This is the petition required by 37 CFR 1 .47 and the statement required by 
37 CFR 1 .47 is also attached. See item 13 below for fee. 

0 Not Enclosed. 

WARNING: Where the filing is a completion in the U.S. of an International Application but where a declaration is not 
available or where the completion of the U.S. application contains subject matter in addition to the 
International Application the application may be treated as a continuation or continuation-in-part, as the case 
may be, utilizing ADDED PAGE FOR NEW APPLICA TION TRANSMITTAL WHERE BENEFIT OF PRIOR U.S. 
APPLICATION CLAIMED. 

0 Application is made by a person authorized under 37 CFR 1 .41 (c) on behalf of 
all the above named inventors. (The declaration or oath, along with the surcharge 
required by 37 CFR 1.16(e) can be filed subsequently). 

NOTE: It is important that all the correct inventorfs) are named for filing under 37 CFR 1.41 (cl and 1.53(b). 

□ Showing that the filing is authorized. (Not required unless called into ques- 
tion. 37 CFR 1.41(d).) 

6. Inventorship Statement 

WARNING: If the named inventors are each not the inventors of all the claims an explanation, including the ownership 
of the various claims at the time the last claimed invention was made, should be submitted. 

The inventorship for all the claims in this application are: 

□ The same 

□ Not the same. An explanation, including the ownership of the various claims at the 
time the last claimed invention was made, 

7. Language 

NOTE: An application including a signed oath or declaration may be filed in a language other than English. A verified 
English translation of the non-English language application and the processing fee of $ 1 30.00 required by 37 CFR 
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1. 1 7(k) is required to be filed with the application or within such time as may be set by the Office. 37 CFR 
1.52(d). 

NOTE: A non-English oath or declaration in the form provided or approved by the PTO need not be translated. 37 CFR 
1.69(b). 

0 English 

□ non-English 

□ the attached translation is a verified translation. 37 CFR 1.52(d). 

8. Assignment 

0 An assignment of the invention to RAMOT UNIVERSITY AUTHORITY FOR APPLIED 
RESEARCH & 

□ is attached. A separate □ "COVER SHEET FOR ASSIGNMENT (DOCUMENT) 
ACCOMPANYING NEW PATENT APPLICATION" or □ FORM PTO 1595 is also 
attached. 

0 will follow. 

NOTE: "If an assignment is submitted with a new application, send two separate letters— one for the application and one 
for the assignment. " Notice of May 4. 1990 (1 1 14 O. G. 77-78). 

WARNING: A newly executed "CERTIFICATE UNDER 37 CFR 3.73(b)" must be filed when a continuation-in-part 
application is filed by an assignee. Notice of April 30, 1993. 1 150 O.G. 62-64. 

9. Certified Copy 

Certified copy of application 

Country Appln. No. Filed 

from which priority is claimed 

□ is attached. 

□ will follow. 

NOTE: The foreign application forming the basis for the claim for priority must be referred to in the oath or declaration. 
37 CFR 1.55(a) and 1.63. 

NOTE: This item is for any foreign priority for which the application being filed directly relates. If any parent U.S. 

application or International Application from which this application claims benefit under 35 U.S.C. 120 is itself 
entitled to priority from a prior foreign application then complete item 18 on the ADDED PAGES FOR NEW 
APPLICATION TRANSMITTAL WHERE BENEFIT OF PRIOR U.S. APPUCATION(S) CLAIMED. 

10. Fee Calculation (37 CFR 1.16) 

A. 0 Regular Application 

Claims as Filed 
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Number Filed 




Number Extra 


Rate 


Basic Fee 
37 CFR 1.16(a) 
$690.00 


Total Claims 
(37 CFR 1.16(c)) 


32 - 20 


12 x $ 


18.00 


216.00 


Independent Claims 
(37 CFR 1.16(b)) 


3^-3 


1 x $ 






Multiple dependent clair 
(37 CFR 1.16(d)) 


n(s), if any 


+ $ 


260.00 





□ Amendment cancelling extra claims enclosed. 

□ Amendment deleting multiple-dependencies enclosed. 

□ Fee for extra claims is not being paid at this time. 

If the fees for extra claims are not paid on filing they must be paid or the claims cancelled by amend- 
ment, prior to the expiration of the time period set for response by the Patent and Trademark Office 
in any notice of fee deficiency. 37 CFR 1. 16(d). 



□ Design application 

($310.00 - 37 CFR 1.16(f)) 



□ Plant application 

($480.00 - 37 CFR 1.16(g)) 



Filing Fee Calculation $ 



Filing Fee Calculation $ 



Filing Fee Calculation 



1 1 . Small Entity Statement(s) 

□ Verified Statement(s) that this is a filing by a small entity 
under 37 CFR 1 .9 and 1 .27 is(are) attached or has been 
filed. 

Filing Fee Calculation (50% of A, B or C above) $ 

NOTE: Any excess of the full fee paid will be refunded if a verified statement and a refund request are filed 
within 2 months of the date of timely payment of a full fee. 37 CFR 7.28(a). 

12. Request for International-Type Search (37 CFR 1.104(d)) (Complete, if applica- 
ble) 

□ Please prepare an international-type search report for this application at the 
time when national examination on the merits takes place. 

13. Fee Payment Being Made At This Time 

0 Not Enclosed 

0 No filing fee is to be paid at this time. (This and the surcharge required 
by 37 CFR 1. 16(e) can be paid subsequently.) 



Enclosed 

□ basic filing fee 
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□ Recording assignment 

($40.00; 37 CFR 1.21(h)) (See attached "COVER 
SHEET FOR ASSIGNMENT ACCOMPANYING NEW 
APPLICATION.") 

□ Petition fee for filing by other than all the inventors 
or person on behalf of the inventor where inventor 
refused to sign or cannot be reached. 

($130.00; 37 CFR 1.47 and 1.17(h)) $ 

□ For processing an application with a specification in 
a non-English language. 

($130.00; 37 CFR 1.52(d) and 1.17(k)) $ 

□ Processing and retention fee 
($130.00; 37 CFR 1.53(d) and 1.21(1)) 

□ Fee for international-type search report 

($40.00; 37 CFR 1 .21(e)). $ 

NOTE: 37 CFR 1.21(1) establishes a fee for processing and retaining any application which is abandoned for 
failing to complete the application pursuant to 37 CFR 1.53(d) and this, as well as the changes to 37 
CFR 1.53 and 1.78, indicate that in order to obtain the benefit of a prior U.S. application, either the 
basic filing fee must be paid or the processing and retention fee of § 1.21(1) must be paid within 1 
year from notification under 153(d). 

Total fees enclosed $ 

14. Method of Payment of Fees 

□ Check in the amount of $ 

□ Charge Account No. 12-0425 in the amount of $ 

A duplicate of this transmittal is attached. 

NOTE: Fees should be itemized in such a manner that it is clear for which purpose the fees are paid. 37 CFR 
1.22(b). 

15. Authorization to Charge Additional Fees 

WARNING: If no fees are to be paid on filing, the following items should not be completed. 

WARNING: Accurately count claims, especially multiple dependent claims, to avoid unexpected high charges, if extra 
claim charges are authorized. 

□ The Commissioner is hereby authorized to charge the following additional fees by this 
paper and during the entire pendency of this application to Account No. 12-0425. 

□ 37 CFR 1 . 1 6(a), (f) or (g) (filing fees) 

□ 37 CFR 1.16(b), (c) and (d) (presentation of extra claims) 

NOTE: Because additional fees for excess or multiple dependent claims not paid on filing or on later presentation must 
only be paid or these claims cancelled by amendment prior to the expiration of the time period set for response 
by the PTO in any notice of fee deficiency (37 CFR 1. 16(d)), it might be best not to authorize the PTO to charge 
additional claim fees, except possibly when dealing with amendments after final action. 

□ 37 CFR 1 .16(e) (surcharge for filing the basic filing fee and/or declaration on a date 
later than the filing date of the application) 

□ 37 CFR 1.17 (application processing fees) 

WARNING: While 37 CFR 1.17(a), (b), (c) and (d) deal with extensions of time under § 1. 136(a), this authorization 
should be made only with the knowledge that: "Submission of the appropriate extension fee under 37 C.F.R. 
1. 136(a) is to no avail unless a request or petition for extension is filed. " (Emphasis added). Notice of 
November 5, 1985 (1060 O. G. 27) 
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□ 37 CFR 1.18 (issue fee at or before mailing of Notice of Allowance, pursuant to 37 
CFR 1.311(b)) 

NOTE: Where an authorization to charge the issue fee to a deposit account has been filed before the mailing of a Notice 
of Allowance, the issue fee will be automatically charged to the deposit account at the time of mailing the notice 
of allowance. 37 CFR 1.311(b). 

NOTE: 37 CFR 1. 28(b) requires "Notification of any change in loss of entitlement to small entity status must be filed in 
the application ... prior to paying, or at the time of paying, ... issue fee". From the wording of 37 CFR 1.28(b): 
(a) notification of change of status must be made even if the fee is paid as "other than a small entity" and (b) no 
notification is required if the change is to another small entity. 

16. Instructions As To Overpayment 



□ credit Account No. 1 2-0425 

□ refund 




Reg. No. 25,858 William R. Evans 

Ladas & Parry 

Tel. No. (212) 708-1930 26 West 61 Street 

New York, NY 10023 



□ Incorporation by reference of added pages 

(Check the following item if the application in this transmittal claims the benefit 
of prior U.S. application(s) (including an international application entering the U.S. 
stage as a continuation, divisional or C-l-P application) and complete and attach 
the ADDED PAGES FOR NEW APPLICATION TRANSMITTAL WHERE BENEFIT OF 
PRIOR U.S. APPLICATION(S) CLAIMED) 

□ Plus Added Pages for New Application Transmittal Where Benefit of Prior U.S. Applica- 
tion^) Claimed 

Number of pages added 

□ Plus Added Pages for Papers Referred to in Item 4 Above 

Number of pages added 

□ Plus "Assignment Cover Letter Accompanying New Application" 

Number of pages added 



0 Statement Where No Further Pages Added 

(If no further pages form a part of this Transmittal, then end this Transmittal with this 
page and check the folio wing item:) 

0 This transmittal ends with this page. 
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PATENT 

IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 

In re application: ITZHAK PEER, et al 

For: METHOD FOR SEQUENCING POLYNUCLEOTIDES 

Attorney Docket No.: U 012911-3 

Assistant Commissioner for Patents 
Washington, D.C. 20231 

Sir: 

PRELIMINARY AMENDMENT 

Please amend the above application as follows: 
IN THE CLAIMS 

Claim 3, line 1, delete "or 2" 

Claim 6, line 1, delete "any one of the previous claims" and replace therefor 

— claim 1— 

Claim 7, line 1, delete "any one of Claims 1 to 6" and replace therefor -- claim 



CERTIFICATE UNDER 37 1.10 

I hereby certify that this paper is being deposited with the United States Postal Service on this 
date AUGUST 22. 2000 in an envelope as "EXPRESS MAIL POST OFFICE TO ADDRESS- 
EE" Mailing Label Number EL699731 101US addressed to the: Commissioner of Patents and 
Trademarks, Washington, D.C. 20231 

JENNIFER RASHKIN 

(Type or print name of person mailing paper) 

qui g.^C 



/(Signature of person mailing paper) ^ 
NOTE: Each paper or fee referred to aYenclosed herein has the number of the "EXPRESS 
EXPRi^^^ffi^ASri? 1 ^ 6 thereon P rior t0 mailin g 37 CFR 1- 16(b). 
NO.:EL699731101US 



Claim 8, line 1, delete "any one of the previous claims" and replace therefor 

claim 1— 

Claim 13, line 1, delete "any one of Claims 1 to 7" and replace therefor 

claim 1— 

Claim 15, line 1, delete "any one of Claims 1 to 7" and replace therefor 

claim 1— 

Claim 19, line 1, delete "any one of Claims 1 to 11" and replace therefor 

claim 1— 

Claim 20, line 1, delete "any of Claims 13, 14, or 19" and replace therefor 

claim 13— 

Claim 21, line 1, delete "any of Claims 15, 17, or 18" and replace therefor 

claim 15— 

Claim 22, line 1, delete "any one of the previous claims" and replace therefor 

claim 1— 

Claim 23, line 1, delete "any one of Claims 1 to 22" and replace therefor 

claim 1— 
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Claim 24, line 1, delete "any one of the previous claims" and replace therefor 

— claim 1 — 



Claim 25, line 1, delete "any one of Claims 1 to 5, 8 to 24" and replace 
therefor — claim 1— 

Claim 26, line 1, delete "any one of the previous claims" and replace therefor 

— claim 1— 



Claim 27, line 1, delete "any one of the previous claims" and replace therefor 

- claim 1— 

Claim 28, line 1, delete "any one of the previous claims" and replace therefor 

- claim 1— 

Claim 29, line 1, delete "any one of the previous claims" and replace therefor 

- claim 1 — 

Claim 30, line 1, delete "any one of the previous claims" and replace therefor 

- claim 1 — 




'TLLIAM R. EVANS 
LAD AS & PARRY 
26 WEST 61 sr STREET 
NEW YORK, NEW YORK 10023 
REG.NO.25858(212)708-1930 
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METHOD FOR SEQUENCING POLYNUCLEOTIDES 



FIELD OF THE INVENTION 

This invention relates to computational methods in molecular biology, and 
more specifically to methods for determining the sequence of a polynucleotide. 
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BACKGROUND OF THE INVENTION 

Sequencing by hybridization (SBH) is a method for sequencing a 
polynucleotide such as a DNA molecule (Bains & Smith 1988, Lysov et al. 1988, 
Southern 1988, Drmanac and Crkvenjakov 1987, Macevics 1989). In this 
5 method, a chip, or microarray is used consisting of a surface upon which all 
possible oligonucleotide probes of a particular length k (referred to herein as 
"k-mers") are immobilized (Southern 1996). The DNA molecule whose sequence 
is to be determined, referred to as the "target molecule ", is allowed to hybridize 
to the k-mers on the chip. The target molecule and the k-mers on the chip may all 

10 be single stranded molecules. Alternatively, a double stranded target may first be 
cut into fragments having single stranded "sticky ends", and the k-mers on the 
chip may be the sticky ends of double stranded molecules. Ideally, a single 
stranded target or the sticky end of a double stranded target hybridizes to a k-mer 
on the chip if and only if the sequence complementary to the k-mer occurs 

15 somewhere in the target sequence or the sticky end. Thus, in principle, it is 
possible to experimentally determine the "k-spectrum" of the target (the set of all 
k-long substrings present in the target). In practice, however, the data are 
ambiguous due to the ability of the target to bind to k-mers that are only partially 
complementary to one of its substrings. Thus, any binarization of the 

20 hybridization signal will contain errors. 

The goal of SBH is to determine the target sequence from the target 
spectrum. However, even if the target spectrum were error free, the target 
sequence is not uniquely determined by the spectrum. If the number of sequences 
consistent with the spectrum is large, there is no satisfactory method to select the 

25 true sequence. Theoretical analysis and simulations (Southern et al, 1992, 
Pevzner and Lipshutz 1994) have shown that even when the spectrum is errorless 
and the correct multiplicity of each k-mer in the target sequence is known, the 
average length of a uniquely reconstructible target sequence using a chip of 
8-mers is only about two hundred nucleotides, far below the length of a DNA 

30 molecule that may be sequenced by electrophoresis. 
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Let E = (A,C,G,T) designate the set of nucleotides composing a DNA 
molecule. M = 4 is the "alphabet size ". A DNA sequence is a string over X which 
is denoted herein between braces (<>). The k-spectrum of a target sequence T of 
length L, T= <h, t2,...ti>, is the set of all k-long substrings (k-mers) of T. For 
5 each k-mer x = <xi, X2,...Xk> in e E k , we define T(x)to be 1 if x is a substring 
of T, and 0 otherwise. We denote K = M k , the number of k-mers. A hybridization 
experiment measures, for each k-mer x in e Z k , an intensity of its hybridization 
with the target. 

The result of an SBH experiment may be described by a graph in which 
10 each candidate target sequence is represented as a path in a graph (Pevzner et al, 
1989). The graph is a directed de-Bruijn graph G(V,E) whose vertices are labeled 
by all the (k-l)-mers (the set of vertices V = S k_1 ), and its edges are labeled by 
k-mers, (the set of edges E = E k ). The edge labeled <xi, x 2 ...x k > connects the 
vertex <xi, X2 ... Xk-i> to the vertex <X2 ...Xk>. There is a 1-1 correspondence 
15 between L-long candidate target sequences and (L - k + 1)- long paths in G, 
whose edge labels comprise the target spectrum. Hereafter, we interchangeably 
refer to edges and their labels, and also to sequences and their corresponding 
paths. 

Since k-mers may reoccur in the target sequence, the paths do not have to 
20 be simple. When the spectrum is perfect and the multiplicities of the k-mers in 
the spectrum are known, every solution is an Eulerian path (Pevzner et al. 1989). 
In practice, however, the spectrum is not perfect and the multplicities are not 
known. 

Alternative chip designs (Bains and Smith 1988, Khrapko et al. 1989, 
25 Pevzner et al. 1991, Preparata et al. 1999, Ben-Dor et al. 1999), as well as 
interactive protocols (Skiena and Sundaram 1995) have been suggested, often 
assuming additional information, in order to reduce the ambiguity of the 
hybridization-based reconstruction. 

Nucleotide sequences from different sources may resemble each other, due 
30 to a common ancestral gene. This phenomenon is encountered within a species, 
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between duplicated regions within a genome, and between individuals within a 
population. Small differences in sequences, referred to as "Single Nucleotide 
Polymorphisms " or SNPs, efficiently serve as genetic markers that are useful in 
medicine. Thus the detection and genotyping of SNPs has become an important 

5 task of human geneticists. The evolution of homologous sequences from a 
common ancestral gene is mainly due to nucleotide substitution. Insertions and 
deletions of nucleotides are also known to have occurred during evolution of 
homologous sequences, though at lower rates. 

A DNA molecule having a known sequence and known to be homologous 

10 to a target molecule has not yet been used to reduce the ambiguity of SBH data in 
order to determine the target sequence. 

SUMMARY OF THE INVENTION 

In the following description and set of claims, two parameters are 
15 considered to be equivalent to each other if they are proportional to each other. 

The present invention provides a method for sequencing a target sequence. 
In accordance with the invention, experimental spectrum data obtained from a 
DNA chip is combined with sequence information of a reference DNA molecule. 
The reference molecule is preferably a molecule believed to be homologous with 
20 the target. For example, the target sequence may be a mutant gene and the 
reference sequence the previously sequenced normal gene. As another example, 
the target sequence may be a human gene and the reference sequence the 
homologous gene in another organism. A score is defined for each sequence in a 
set of candidate target sequences based upon a simultaneous comparison of the 
25 candidate sequence with the spectrum and with the reference sequence. A 
candidate target sequence is then selected having a essentially maximal score. 
Calculating the score does not require knowledge of the multiplicities of the 
k-mers in the k-spectrum. Moreover, unlike all prior art algorithms, the score 
does not assume that the spectrum is perfect. 



-6- 

The invention therefore provides a novel probabilistic method that handles 
imperfect hybridization data with unknown multiplicities. Thus, in accordance 
with the invention the hybridization of the target T with the k-mer on the DNA 
chip complementary to x is described by probabilities Po(x) and Pi(x) of the 

5 observed hybridization signal when t(x) = 0, and t(x) = 1, respectively The 
results of a hybridization experiment are described by the "probabilistic 
spectrum" (PS ) defined as the pair (Po,Pi) of functions Pi: S k [ 0, 1]. If the 
experiment were perfect, i.e., if Po(x) and Pi(x) are either 0 or 1 with 
Po(x)+Pi(x) 1, then the PS would represent the k-spectrum. In practice, 

10 however, Po(x) and Pi (x)are both positive. There is thus a chance l-Po(x) for a 
false positive (a k-mer (x) not occurring in T, whose complementary sequence 
produces a hybridization signal indicative of hybridization) and a chance 1-Pi(x) 
for a false negative (a k-mer (x) occurring in T, whose complementary sequence 
produces a signal indicative of no hybridization). (When handling probabilities, 

15 some of which are perfect, problems of division by zero might occur. This is 
avoided by implicitly perturbing probabilities 0 and 1 to s and 1-s.) 

The probability of obtaining a specific spectrum PS when T is used as the 
target is referred to as the "experimental likelihood". The experimental likelihood 
is calculated assuming that the hybridization results of the target to different 

20 k-mer probes are mutually independent. In one embodiment of the invention, an 
experimental likelihood L e (f) is used that does not assume knowledge of the 
multiplicities of each k-mer in the sequence. L e (f ) is given by: 



L e (f)=Prob(pS|f)= J! PtooOO 



(1) 



25 



Taking logarithms and defining co(x) = log- 



we can write: 



'° g ^> W |logP 0 (x) + C ,(3c) // f(x)=l. <*> 

Hence, 

l0g£'(f)= X log^o(^)+ !>(*)• (2b) 

The first term is a constant (independent of f ), and is omitted hereafter. 

In another embodiment, an approximate likelihood z(f ) is used, that is 
defined as follows: Let p = eo, eL-k be the path in G corresponding to f and 
define 

L-k 



/og£*(f)=X®fe) 



(3) 

L e (f )=L e (f) for a path in which all edges have a multiplicity of 1, and is 
otherwise an approximation to L e (f). L e (f) has the advantage of being easily 
computable in a recursive manner: 

logL e (e 0 . .e t ) = logV (e 0 , . . .e M ) + cofa ) (4) 

In yet another embodiment, an experimental likelihood L e (f ) is used that takes into 
account the multiplicities of edges. In this case, the probabilistic spectrum consists 
of probabilities Pi(Jc), denoting the probability of the observed hybridization signal 
when the multiplicity of x in the target is i. L e (f ) is defined by: 

r(f)=Prob{pS\T)= n P m &) 



^ — (4b) 



20 where f(x) is the multiplicity of x in f . 
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Thus in its first aspect, the invention provides a method for obtaining a 
candidate sequence, the candidate nucleotide sequence being indicative of a 
sequence of a target polynucleotide molecule T, T producing a hybridization signal 
I( x ) upon incubating T with a polynucleotide x for each polynucleotide 3c in a set 
5 E of polynucleotides, the method comprising the steps of: 

(a) for each polynucleotide 3c in the set E of polynucleotides, obtaining a 
probability Po(3c) of the hybridization signal I(x) when the sequence x is not 
complementary to a subsequence of T and a probability Pi(x) of the hybridization 
signal when the sequence x is complementary to a subsequence of T; so as to 

io obtain a probabilistic spectrum (PS) of T; 

(b) assigning a score to each of a plurality of candidate nucleotide 
sequences, the score being based upon the probabilistic spectrum and upon at least 
one reference nucleotide sequence H; and 

(c) selecting one or more candidate nucleotide sequences having an 
15 essentially maximal score. 

In its second aspect, the invention provides a program storage device 
readable by machine, tangibly embodying a program of instructions executable by 
the machine to perform method steps for obtaining a candidate nucleotide 
sequence, the candidate nucleotide sequence being indicative of a sequence of a 
20 target polynucleotide molecule T, T producing a hybridization signal I(x) upon 
incubating T with a polynucleotide x for each polynucleotide x in a set E of 
polynucleotides, the method comprising the steps of: 

(a) for each polynucleotide 3c in the set E of polynucleotides, obtaining a 
probability Po(3c) of I(3c) when the sequence 3c is not complementary to a 

25 subsequence of T and a probability Pi(5c) of I(x) when the sequence 3c is 
complementary to a subsequence of T; so as to obtain a probabilistic spectrum (PS) 
ofT; 

(b) assigning a score to each of a plurality of candidate nucleotide 
sequences, the score being based upon the probabilistic spectrum and upon at least 

30 one reference nucleotide sequence H; and 
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(c) selecting a candidate nucleotide sequence having an essentially 
maximal score. 

In its third aspect the invention provides a computer program product 
comprising a computer useable medium having computer readable program code 
5 embodied therein for obtaining a candidate nucleotide sequence, the candidate 
nucleotide sequence being indicative of a sequence of a target polynucleotide 
molecule T, T producing a hybridization signal I(x) upon incubating T with a 
polynucleotide x for each polynucleotide x in a set E of polynucleotides, the 
computer program product comprising: 
10 (a) for each polynucleotide x in the set E of polynucleotides, computer 

readable program code for causing the computer to obtain a probability Po(x) of 
I( x ) the sequence x is not complementary to a subsequence of T and a probability 
Pi( x ) of I( x ) when the sequence x is complementary to a subsequence of T; 

(b) computer readable program code for causing the computer to assign a 
15 score to each of a plurality of candidate nucleotide sequences, the score being 

based upon the probabilistic spectrum and upon at least one reference nucleotide 
sequence H; and 

(c) computer readable program code for causing the computer to select a 
candidate nucleotide sequence having an essentially maximal score. 

20 

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 
First Embodiment 

In this embodiment, the unknown target sequence T = <ti ... ti> has a 
known, homologous reference sequence H = <hi ... hi>. H and T are known to 
25 differ from each other by nucleotide substitutions without insertions or deletions 
(indels). This would be the case, for instance, when the target T is a mutant 
sequence whose wild type sequence H has already been sequenced, and one 
expects that nucleotide substitutions are the only cause of variability between H 
and T (statistically, substitutions are much more prevalent than indels (Wang et 
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al. 1998)). A set of M xM position specific substitution matrices are 
used, where for each position j along the sequence: 

M (j) [i,r] = Probitj = i | hj = / ' ) ( 5 ) 

5 for nucleotides i and i' e £. 

The matrices AdP may be the same for all j, or may different for different 
positions j . The matrices A^P are used to calculate a distribution on the space of 
possible target sequences. This "prior distribution for ungapped homology", D u , 
is given, for each candidate target sequence T by: 

10 

D u (f ) = P rob{f \h) = Y\ M (j) % , h } } ( 6 ) 



One may recursively compute: 

Di{t,...t ] ))={{t v ..t j _ s ))-M^[t i ,h J ] (7) 

We denote L {j) [x, y] = log M {j) [x, y] . 

The probability of a candidate target sequence f , given the probability 
spectrum PS and the reference sequence H is: 



V 1 ' Prob(H,PS) w 



Given f , the hybridization signal is independent of H: 



25 



Prob{pS | H,i)= Prob(pS | f ) 



Thus, omitting the constant — P r °b{H) we write: 
Prob(H,PS) 



P rob(t | H,PS)= D u (f ) • If (f ) (9a) 
Prob{f \H, PS) = D u (f ) • V (f ) (9b) 
or Prob(t \H,Ps)=D u {l)-L e (f) (9c) 
Taking logarithms, the following "ungapped scores" of a candidate target 
are obtained: 



Scored (f )= log// (f )+ log (f ) (10a) 
S corei u (f ) = log V (f )+ log (f ) (1 Ob) 

S cores" (f ) = log L e (f )+ log D u (f ) (1 Oc) 



With Score 11 ] , Score u 2 or Score "3, the higher the score of a sequence f , 
the more likely it is to be the target sequence. The highest scoring candidate 
sequence may be determined by any method known in the art. In the search for 
the highest scoring candidate sequence, complexity is preferably reduced by 
deleting from the graph edges for which L e (f) L e (f) or L e (f) is less than a 
predetermined constant. Isolated vertices corresponding to highly improbable 
(k-l)-mers, are also preferably deleted from the graph. 

For example, using L e (f ), the search for a high scoring candidate sequence 
may be performed by the following algorithm referred to herein as "Algorithm 
A". In accordance with Algorithm A, for each vertex y = (y l ...y k _ l )e'L k ~\ and 
integer j = k - 1, k, k + 1, I, let S^y] be the maximum score of a ./-long 
sequence ending with y aligned to (h v ..h^ . Initialize, for each y : 
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Loop over j = k, /, and for each vertex y = (y v -y k - 1 ) recursively 
update: 

S" \y, j] = I 0) [y M , h]+m^S" [z, j - 1] + } (12a) 



Finally, return: 



MAX Score" = max S u [y,l] 



max° LJVJ (12b) 



A sequence T* attaining the maximal score is found from the matrix S u as 
is known in the art, for example, by saving trace-back pointers: 

p[y,j]= argmax {s u [zj -\}+co{e) j (Ba) 

Z=<Z„Z, Z A _ 1 >,i'=(ZJ)ec' 

MAXPtr = argma ^[y,l] (13b) 

The maximum- scoring path in the graph is then followed, by setting: 
zl^MAXPtr, and for all j = k, : Z'^PlZ'j]. Denote 2=<£ ! z! 2 ...z i k -i>- 

k 1 k-1 k-1 k 

The final result is the sequence of nucleotides <z~ 1 , z~ 2,—z' k-i, z k-i, 

k+l I . 

Z k-l,---Zk-l> 

The time complexity is 0(lK), since the maximization in (12a),(13a) is a 
maximum of only a constant number (four) of terms. Although the complexity is 
exponential in k, it is constant for a given microarray (currently feasible values 
are k = 8 or 9). Moreover, the complexity scales linearly with the size of the 
hybridization experimental results, which are part of the input. 



Space complexity requires a more elaborate analysis. When naively using 
this algorithm, it requires 0(lK) memory space, which is quite high for current 
technology microarrays. We now detail how we can modify the algorithm to 
reduce space complexity. 

Observe, that this algorithm consists of two computations: Computing the 
optimal score (equations (ll),(12a) and (12b)), and reconstructing the optimal 
sequence (equations (13a) and (13b)). The first task, of computing the optimal 
score alone, is space-efficient: it can be accomplished using space which is linear 
in the (effective) size of the hybridization experimental data, that is, 0([K]) 
space. 

By following the paradigm of Hirschberg (Hirschberg 1975), for example, 
for linear-space pair-wise alignment, a version of the algorithm is obtained which 
requires only linear space. The reduced space complexity is traded for time 
complexity, which increases by an 0(log I) factor. 

For each position j = I, l-l, k, k-l, the score of the entire sequence is 
decomposed. The total score is represented as a sum of two expressions: the 
contribution of its (j - k + 7)-prefix, which equals the score of this prefix 
computed by S", plus the contribution of the corresponding suffix. Formally, for 
each vertex y = (y l ...y k _ l )eV, let iT[jp,y] be the maximum contribution to the 
score of a (/ -j + k - 7)-long sequence beginning with y aligned to {h^ k+2 - h i) ■ 
Initialize, for each y : 



Loop over j = I - I, I - 2, k-l, and for each vertex y = (y v ..y k _ x ) 
recursively update: 




(14) 



R"\?J] = e _%% E { RU [zJ + !]+ *>( e )+ V.] ] (15) 
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Observe that, for all k - 1 <j < I 



MAX Score" = max 

yeV 



{s u [y,j] + R u [y,j]\ 



(16) 



5 



Equation (16) can be used to decompose the problem into two similar 
problems, of half its size. Recursively solving these sub-problems gives a 
divide-and-conquer approach for finding the optimal sequence. The linear space 
algorithm is therefore as follows: 

1 . If the length I of the target is smaller than some constant C, for 
example, 25 nucleotides: 

Solve the problem directly, according to the dynamic 
program of Equations (1 1), (12a), (12b), (13a) and (13b). 

Otherwise, 



3. For each j = k— 1, m: 

Compute S v [y,j] (following equations (11) and (12a)) for all y , 
re-using space. 

4. For each j = 1,1- 1, m: 

Compute ^"[y.y] (following equations (14) and (15)) for all y , 
re-using space. 

5. Find y m = argmax{s u [y,m]+ R u [y,m] , 

thereby computing: MAX Score" \ by (16). 

6. Recursively compute: 



2. 



Set m = 



l + k-l 



15 



2 



25 



(a) The optimal sequence aligned to (fy ...h m ) 



ending with y m . 



(b) The optimal sequence aligned to (h m ...h t ) 



beginning with y m 



Observe, that for each y, j, the values of S" [y, j] and ^"[y^'Jare computed 
a total of log / times. Thus the algorithm takes 0([KJI log /) time and O(fKJ) 
space, using the effective spectrum. 

Second Embodiment: substitutions and deletions 

In this embodiment, the unknown target sequence T = (t v ..t r ) differs from 
the reference H = (h v ..h,) , by substitutions and deletions only, without insertions. 

Denote the probability of initiating a gap right before hj (aligning hj to 
space) is 2°' . Similarly, is the logarithm of the probability for gap extension at 
hj. Also define p, =log(l-2 Pj )d j =log(l-2" J ). To overcome boundary problems 
at the ends of the sequence, we extend the alphabet by including left and right 
space characters: E = EU{>,<}. We augment the reference sequence by the string 
t> k on its left and < k on the right. We extend the substitution matrix by using 
probabilities that force alignment of each of > and <to itself. Formally, we 
define: 



U {xz\x=> J ,zeZ k 



(17) 



We arbitrarily set co(y) to 0 for each yeS k 1 \2 k \ Thus, the weighted 
de-Bruijn graph is naturally extended over Z k ~ ! , and so is [G] = ([V], [E]), its 
effective subgraph. Hereafter, we use the notation [G] for the extended graph. As 
with the previous embodiment, in order to reduce complexity, edges for which 
L e (f) or L e (f) is less than e are preferably deleted from the graph. Isolated 
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vertices corresponding to highly improbable (k-l)-mers, are also preferably 
deleted from the graph. 

The search for a high scoring candidate sequence may be performed by the 
following algorithm referred to herein as "Algorithm B". In accordance with 
5 Algorithm B, for each y = (y I ...y k _ l )e [V], j = k = 1, k, k + 1,..., 1, S d \y,j] is 

defined as the maximum score of aligning a sequence ending with y to (h x ...h^ 
where h, is aligned to a gap (and y k .i is aligned to some hi...hj). Further T rf [x,y] 
is defined as the maximum score of aligning a sequence ending with (jv ^-i) > 
to (h v ..h^j where hj aligned to_y/ c ./. Initialize, for each y : 



10 



S d \y,k-\] 



= — oo; 



(19) 




(20) 



15 



Loop over j = k, /, and for each y = (y,...y k _,)<a , [V], recursively 



update : 



S d [y,j]=max 



(21) 



20 




r rf [ij-i]+a. 
s d [zj-i\+Vj 



'j 




(22) 



Finally, return: 
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MAX Score" =T d [< k - l ,l] 



(23) 



The complexity of this algorithm is still 0(l[KJ) and a linear space variant 
can be obtained, as described in the previous embodiment. A sequence T 
5 attaining the maximal score is then formed from the matrix T d as is known in the 
art, for example, by saving trace-back pointers to follow the maximally scoring 
path in analogous manner to that described in the previous embodiment. 

Third Embodiment: Substitutions, Deletions and Insertions. 

!0 In this embodiment, a target sequence is determined when the target is 

known to be obtained from the reference by substitutions, insertions and 
deletions. The algorithm is an extension of the dynamic programs of the previous 
embodiments. 



Denote by Tj the target prefix whose last nucleotide is aligned to hj in the 



15 reference sequence. Further denote by aj (respectively bj) the log-probability of 
initiating (extending) an insertion in the target after Tj, and define 



a f =l-a l ,b l =l-b r 

Consider the weighted graph (G,co). Define the Kx K matrix fFas follows: 



20 




(24) 



W [x,y] is thus the probability of moving from x to y along i edges. The 
probability of an insertion of length i after Tj is ^fi'fij • Suppose that the prefix 
Tj ends with x. Then a } b l ~ x b jW'[x,y] is the probability of Tj+i ending with y 
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and being i nucleotides longer than Tj. The matrix W governing the stochastic 
progression from Tj to Tj+i is calculated as follows: 

5 

= ajW + ajbjbjW^b'^W'- 2 

i>2 

= ajW + ajbjbjW 2 (/ - bjW)' 1 (27 ) 

10 A new weighted graph (G',co') is now defined as follows. The vertex set of 

G is also the vertex set of G'. The edge set E' of O is the set of all pairs x,y with 
^'[x,y])0. Each such edge e = [jc,j5] is associated with a weight w'(e) = log 
W'[x,y]. 

The search for a high scoring candidate sequence may be performed by the 
15 following algorithm referred to herein as "Algorithm C". In accordance with 
Algorithm C, Algorithm B of the second embodiment is applied to (G' co) instead 
of (G, co). 

In contrast to G, degrees in G' are not bounded by 4. Therefore, computing 
each dynamic program cell has complexity 0(K) in the worst case, with the total 

20 complexity of the algorithm being 0(l\E'\). Again, considering only the effective 
size of the graph allows more efficient computation, taking 0(l\ [£]!). 
Fourth Embodiment: Substitutions, Deletions and Insertions. 

In this embodiment, homology between nucleotide sequences is described 
by Hidden Markov Models (HMMs) using a set Q of Markov chain states with a 

25 predefined set of allowed transitions between them. For each level (position 
along the sequence) j = 1, Iq, Q includes three states: Mj (match), Ij (insert), 
and Dj (delete). M, and Dj can be reached from the three (J-l) (th) level states. Ij 



(25) 
(26) 
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can be reached from the three (/)-(th) level states (including a self-loop). 
Transition probabilities are as described in previous sections, e.g., aj = 
Prob(M ] i-> Additionally, each insert or match state, q, induces a vector of 
emission probabilities M 1 , where Afl\i\ is the probability that the target nucleotide 

5 is i. We denote L q [i] = 0 for q = D h L q [i] = log otherwise. We write lpb{X) = 
log Prob(X) for short. 

The search for a high scoring candidate sequence may be performed by the 
following algorithm referred to herein as "Algorithm D". In accordance with 
Algorithm D, a three dimensional array S is defined, where for each 

10 q &Q,y = (j, e \y\ r = k,..., L, S[q,y,r] is defined as the maximum score of 

an r-long sequence ending with (y,...^.,), whose alignment to the profile ends in 
q. Thus, initialize: 

^.>*" 1 »*- 1 ] = 0 (28) 

15 

S[q, y,k-l] = -co for other 

values of y, q ^ 

Loop over r = k, ... 1, and for each y = (jv-JVi) e [vj r < l„ , recursively 
update: 

20 

S[q, y, r] = L q [y k _, ] + jjns^{S[q' 9 z,r-l] + lpb{q' ^ q) + co(e) } (30) 

q'\q'h+q 

Finally, return: 



MAX Score = m2x{s[q endX < h ~\ I 



(31) 



A sequence T* is maximal score is then found in a manner similar to that 
described in the previous embodiments. 

This algorithm requires 0{Iq ■ [K] ■ L) time and space, where L is an upper 
bound on the size of the target sequence. As with the previous embodiments, the 
complexity of this algorithm can be reduced to 0(Iq -[K] -L log L) time and 0(l Q 
■[K]) memory. Furthermore, one can consider the dynamic program as filling a Iq 
x L matrix, with a |X]-long vector in each matrix cell. Since all values far from 
the main diagonal of this matrix should be negligible, preferably only values 
within a distance less than a predetermined constant R to the main diagonal are 
calculated, reducing the complexity to 0(R(l Q +L) .[K\ -log L) time and 
0{R(l Q +L) -[K]) space. 

Fifth Embodiment: Summation over all paths 

In this embodiment the graph nodes (HMM states and k-mers) that are 
most likely to be visited at a certain position along the target sequence are 
obtained. The "Forward-Backward" algorithm is used (see, e.g., Durbin et al., 
1998) providing the likelihood summed over all paths entering a node, instead of 
the likelihood of the maximum path. The only change to the equation presented 
thus far is that max operators are changed into log-sum-of-exponents. More 
specifically, equations (12a), (12b), (15), (16), (20), (21), (29), and (30) are 
re- written, respectively, as follows: 



r[y,y] = L w [j t _„ftJ + log XexpHzJ-lJ + ^e)) 



(12a') 



e={z,y^E) 




R u [yj]= log X ex P ( R " k J + J + w ( e ) + L ° +l) > h 



l j+UJ (15') 



e={y,z)eE 



MAXScore u = log £ exp(s M [y, j] + R u [y, j]) 



(W) 



-21 - 

^l3>,7l-Mexp(^[3),y-l] + a>exp(^[3),7-l] + ^)) (20 ') 

Z«[y w ,/iJ+log 2>xp(^)) + 

/ / „r i f'* hE ( d r i ~ W (210 
+ lo^ex I (r1z, 7 -l]+a>ex^[z,7-l]+^.)) 

Sfoj?,/-] = Z, 9 + 2]exp(^',z,r-l] + /^(g'h->^) + ^)) (29 , } 

M4X Score= log£exp(s[^i <*~\ /]) (30') 

/ 

5 

Sixth Embodiment: Enhancements 

In this embodiment the exact likelihood calculated according to Equation 
10a of several top-scoring candidates found using the approximated likelihood 
(Equation 10b) is calculated. These sequences are then re-ranked. This 2-phase 
10 filtering is more discriminative than approximated scoring, while still tractable 
using the formulae presented. 

If the score of a dynamic programming cell is very low, that cell probably 
does not participate in the maximum solution. This allows discarding such cells, 
without risking loss of the true optimum. Computing time and space may thus be 
15 saved. 

The invention may be used for simultaneously re-sequencing several short 
targets, instead of a single long sequence. This scenario arises when considering 
many exons of a single gene. The invention may also be generalized to deal with 
DNA chips that do not contain the set of all k-mers. 
20 When the set of oligonucleotides on the microarray is not the set of all 

k-mers, a graph is constructed consisting, as vertices, instead of all the 
(k-l)-mers, all the prefixes and suffixes of oligonucleotides on the microarray. 
Edges in this graph connect two vertices if there is one base pair suffix (suffix) 
addition to one of them, that makes the other its proper suffix (prefix). The 
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scoring mechanism remains the same. This also applies for oligonucleotides 
containing "gaps" or "universal bases" (Preparata et al., 1999). 

The invention may be used also for sequencing polypeptides. Given a 
polypeptide chain homologous to a target, and given a collection of probabilities 
5 of occurrence of sub-chains along the target, our algorithms will find the optimal 
candidate target sequence. 
Example 

The invention was implemented and tested on simulated data. Nucleotide 
substitutions were equiprobable and insertions and deletions were not allowed. 
10 As a reference sequence, prefixes of the gene-rich human mitochondrial 
sequence, (Accession Number JO 14 15) were used. For each reference sequence, 
the following were generated: 

1 . A target sequence generated according to a prescribed probability q 
of substitution, defining the matrix M as 1-q on the diagonal and q/3 

15 elsewhere. 

2. An 8-spectrum for the target was generated using the probabilistic 
spectrum defined by P, (x ) = 1 - p if T(x) = i , where p is a fixed probability. 

All probabilistic parameters were constant, i.e., position/&-mer 
independent. For each 8-spectrum and target sequence, candidate 
20 sequences were scored using Eq. (10), and a candidate sequence of 

maximal score was found. 

The algorithm was implemented in C++ and executed on Linux and SGI 
machines. Running times, on a Pentium 3, 600MHz machine, were roughly 0.12/ 
log / seconds for an /-long reference sequence (ranging from roughly 7 minutes 
25 for a 500bp-long sequence to 2.5 hours for 6Kb). Only the main memory was 
used, with the application consuming at most 40Mb.The graph was not reduced 
to its effective size. This would have reduced both space and time dramatically, at 
the expense of possibly missing the truly maximal scoring sequence. 

The performance of the algorithm was quantified by the following figures 
30 of merit: 
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1 . Full success rate-The fraction of runs for which the target sequence 
was perfectly reconstructed. 

2. s-success rate - The fraction of runs for which the target sequence 
of length 1 was reconstructed with fewer than s»l nucleotide errors. 

5 3. Average sequencing error - The fraction of nucleotide errors. 

Table 1 presents results for a scenario of distinct, but closely related 
sequences, e.g., orthologous genes in a pair of primates. We assume perfect 
hybridization data with 97% sequence similarity (that is q=0.03). The results 
show that sequences of length up to 2000 can be reconstructed almost perfectly. 

10 The non-monotonicity of the figures of merit with respect to the target length is 
probably due to sequence contents. 

Table 2 presents results for a scenario of SNP-genotyping. The rate of 
SNPs is assumed to be 1 :700 (Wang et al. 1998), and p=2% was used. The results 
show that a high success rate is achievable even in the presence of spectrum 

15 errors. 



Table 1 



Length 


# runs 


% full 


% s-success 


% avg. 






success 


e= 10' J 


s = 2.10" J 


error 


500 


10 


100 


100 


100 


0.000 


1000 


10 


100 


100 


100 


0.000 


1500 


10 


100 


100 


100 


0.000 


2000 


17 


94 


94 


94 


0.003 


2500 


13 


46 


53 


69 


0.295 


3000 


14 


71 


78 


78 


0.488 


3500 


5 


0 


20 


20 


4.091 


4000 


13 


76 


84 


84 


2.173 


4500 


11 


9 


18 


45 


0.091 


5000 


15 


0 


13 


53 


4.149 


5500 


7 


14 


28 


71 


0.119 



20 Table 1 : Results on simulated date, for different sequence lengths, assuming 
97% sequence similarity between the target and the reference, and perfect 
hybridization data. 
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Table 2 



Length 


# runs 


% full 
success 


% e-success 
8 =10" 3 s = 2-10- 3 


% avg. 
error 


250 


10 


100 


100 


100 


0.000 


500 


10 


100 


100 


100 


0.000 


750 


10 


90 


90 


100 


0.013 


1000 


10 


90 


90 


90 


0.010 


1250 


10 


90 


100 


100 


0.032 


1500 


12 


91 


100 


100 


0.033 


1750 


10 


60 


80 


80 


0.109 


2000 


10 


60 


90 


90 


4.965 


2500 


10 


0 


80 


100 


10.312 


3000 


10 


30 


70 


90 


0.230 



Table 2: Results on simulated data, for different sequence lengths, assuming 
5 p = 2% error the hybridization data, with 1 :700 sequence difference. 



It will also be understood that the system according to the invention may 
be a suitably programmed computer. Likewise, the invention contemplates a 
computer program being readable by a computer for executing the method of the 
10 invention. The invention further contemplates a machine-readable memory 
tangibly embodying a program of instructions executable by the machine for 
executing the method of the invention. 
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CLAIMS: 

1. A method for obtaining a candidate nucleotide sequence, the candidate 
nucleotide sequence being indicative of a sequence of a target polynucleotide 

5 molecule T, T producing a hybridization signal l(x) upon incubating T with a 
polynucleotide x for each polynucleotide x in a set E of polynucleotides, the 
method comprising the steps of: 

(a) for each polynucleotide x in the set E of polynucleotides, obtaining a 
probability Po(x) of the hybridization signal I(x) when the sequence x is not 

io complementary to a subsequence of T and a probability Pi(x ) of the hybridization 
signal when the sequence x is complementary to a subsequence of T; so as to 
obtain a probabilistic spectrum (PS) of T; 

(b) assigning a score to each of a plurality of candidate nucleotide 
sequences, the score being based upon the probabilistic spectrum and upon at least 

15 one reference nucleotide sequence H; and 

(c) selecting one or more candidate nucleotide sequences having an 
essentially maximal score. 

2. The method according to Claim 1 , wherein the polynucleotides x in the set 
E are immobilized on a surface. 

20 3. The method according to Claim 1 or 2, wherein the set E is a set of k-mers. 

4. The method according to Claim 3 wherein E is the set of all k-mers formed 
from nucleotides from a predetermined set of nucleotides.. 

5. The method of Claim 4 wherein the predetermined set of nucleotides is 
selected from the group consisting of 

25 (a) adenine, guanine, cytosine, and thymine; and 

(b) adenine, guanine, cytosine, uracil. 

6. The method according to any one of the previous claims, wherein the score 
of a candidate nucleotide sequence f is based upon Z e (f) where 
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wherein f(x) = 0 if the sequence of x is not complementary to a subsequence of 
f and f (x) = 1 if the sequence of x is complementary to a subsequence of f . 

7. The method according to any one of Claims 1 to 6, wherein the score of a 

candidate sequence f is based upon Z/(t) where logZ e (f )= J®(e,), wherein f 

5 contains polynucleotides e 0 , . . .e m and co{e l ) = log ^ jf' } . 

8. The method according to any one of the previous claims, wherein T and H 
have a common length. 

9. The method according to Claim 8, wherein the score of a candidate 
sequence T is based upon D u ( f ) where D u (f) = f\ M {l) [t f , h, ] , wherein M a) [% h,] 

10 is a probability of a nucleotide tj in position j of T being replaced with nucleotide hj 
in position j of H. 

10. The method according to Claim 9, wherein the score of a candidate 
nucleotide sequence f is Scored (f ), or Score u 2 (f ) where 

Score\(f)= lo g r (f)+logD"(f) and Score 11 : 2 (f)= logr(f)+logD*(f). 
15 11. The method according to Claim 10 wherein the polynucleotides in the set E 
are k-mers and the step of selecting a candidate nucleotide sequence having an 
essentially maximal score comprises the steps of 

k-l r 

(a) For each (k- 1 )-mer y calculating S " \y,k - 1] = L J [y h } 

7=1 

(b) for each integer j = k, ... 1, 

20 (ba) for each polynucleotide sequence (yi, . . . yk-i) 

(baa) calculating 

S" \y, j] = L U) \y^,hj]+ max {s"[z, w{e 

J e={z,y)eE 

wherein L a) [y, hj]= log M G) [y, hj]. 

(bab) selecting a (k- 1 )-mer P[ 5? j] 
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satisfying 

(c) selecting a (k-l)-mer Z 1 having a score essentially equal to 
5 (d) for j=k-l,...,l-l; recursively calculating (k-l)-mers Z J where 

k 1 k-1 k-1 k 

(e) selecting candidate target sequence <z i, z 2,---z k-i, z k-i, 
z k+ Vi,. • .z' k -i>, where zWi . .z j k _i> 

12. The method according to Claim 9, wherein the polynucleotides in the set E 
io are k-mers, and the step of selecting a candidate nucleotide sequence having an 

essentially maximal score comprises the steps of; 

(a) If the length / of the target is greater than the predetermined constant, 

l + k-l 
settling m = — ^— ; 

(b) For each j = k- 1, m, computing S ,w [j,y] according to Claim 10 
15 for all y ;. 

(c) For each j = I, I - 1, m, computing i? M [j,i] according to 
equations (14) and (15) for all y ; 

(d) Selecting y m -argm.ax{s u \y,m\+ R u \y,m\ ; 

(e) Computing the optimal sequence aligned to (h v ..h m ) ending with y m , 
20 and the optimal sequence aligned to (h m ...h t ) beginning with y m . 

13. The method according to any one of Claims 1 to 7, wherein H and T have 
lengths such that the length of T is less than the length of H. 

14. The method according to Claim 13, wherein the step of assigning a score to 
each of a plurality of candidate nucleotide sequences and the step of selecting the 

25 candidate target sequence are performed according to Algorithm B. 
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15. The method according to any one of Claims 1 to 7, wherein H and T have 
arbitrary lengths. 

16. The method according to Claim 15, wherein the step of assigning a score to 
each of a plurality of candidate nucleotide sequences and the step of selecting the 

5 candidate target sequence are performed according to Algorithm C. 

17. The method according to Claim 15, wherein the step of assigning a score to 
each of a plurality of candidate nucleotide sequences and the step of selecting the 
candidate target sequence are performed according to Algorithm D. 

18. The method according to Claim 17 wherein a Hidden Markov Model is used 
l o instead of a reference sequence . 

19. The method according to any one of Claims 1 to 11, wherein the algebraic 
equation (12a 1 ) replaces the algebraic equation (12a), the algebraic equation (12b') 
replaces the algebraic equation (12b), the algebraic equation (15') replaces the 
algebraic equation (15), and the algebraic equation (16') replaces the algebraic 

15 equation (16). 

20. The method according to any one of Claims 13,14, or 19, wherein the 
algebraic equation (20') replaces the algebraic equation (20), and the algebraic 
equation (21') replaces the algebraic equation (21). 

21. The method according to any one of Claims 15, 17, or 18, wherein the 
20 algebraic equation (29') replaces the algebraic equation (29), and the algebraic 

equation (30') replaces the algebraic equation (30). 

22. The method according to any one of the previous claims wherein the target 
comprises two or more polynucleotide molecules. 

23. The method according to any one of Claims 1 to 22 computing the exact 
25 score L e (f ) for several candidate sequences chosen according to the value of the 



approximated score L a \fj. 

24. The method according to any one of the previous claims further comprising 
a step of deleting candidate sequences having likelihood below a predetermined 
value. 



25. The method according to any one of Claims 1 to 5, 8 to 24, wherein the 
score of a candidate nucleotide sequence f is based upon I/(t ) where 

^)=n%-,(*), 

xeA 

wherein f(x) = r if the sequence of x is complementary to exactly r subsequences 
off. 

26. The method according to any one of the previous claims, wherein the set E 
of polynucleotide does not include all the polynucleotide of a specific length. 

27. The method according to any one of the previous claims, wherein the set E 
of polynucleotide includes polynucleotides of different lengths. 

28. The method according to any one of the previous claims for use in a task 
selected from the group comprising: 

(a) Detecting or genotyping of Single Nucleotide Polymorphisms. 

(b) Detecting or genotyping of genetic syndroms or disorders. 

(c) Detecting or genotyping somatic mutations. 

(d) Sequencing a polynucleotide whose function is related to the 
function of the reference polynucleotide. 

29. The method according to any one of the previous claims, wherein 
polynucleotides contain gaps, or universal bases. 

30. The method according to any one of the previous claims, wherein 
polypeptides are sequenced instead of polynucleotides. 

31. A program storage device readable by machine, tangibly embodying a 
program of instructions executable by the machine to perform method steps for 
obtaining a candidate nucleotide sequence, the candidate nucleotide sequence being 
indicative of a sequence of a target polynucleotide molecule T, T producing a 
hybridization signal I(x) upon incubating T with a polynucleotide x for each 
polynucleotide x in a set E of polynucleotides, the method comprising the steps of: 

(a) for each polynucleotide x in the set E of polynucleotides, obtaining a 
probability P 0 (x) of I(x) when the sequence x is not complementary to a 
subsequence of T and a probability Pi(x) of I(x) when the sequence x is 



complementary to a subsequence of T; so as to obtain a probabilistic spectrum (PS) 
ofT; 

(b) assigning a score to each of a plurality of candidate nucleotide 
sequences, the score being based upon the probabilistic spectrum and upon at least 
one reference nucleotide sequence H; and 

(c) selecting a candidate nucleotide sequence having an essentially 
maximal score. 

32. A computer program product comprising a computer useable medium 
having computer readable program code embodied therein for obtaining a 
candidate nucleotide sequence, the candidate nucleotide sequence being indicative 
of a sequence of a target polynucleotide molecule T, T producing a hybridization 
signal I(x) upon incubating T with a polynucleotide x for each polynucleotide x 
in a set E of polynucleotides, the computer program product comprising: 

(a) for each polynucleotide x in the set E of polynucleotides, computer 
readable program code for causing the computer to obtain a probability Po(x) of 
I(x) the sequence x is not complementary to a subsequence of T and a probability 
Pi( x ) of I( x ) when the sequence x is complementary to a subsequence of T; 

(b) computer readable program code for causing the computer to assign a 
score to each of a plurality of candidate nucleotide sequences, the score being 
based upon the probabilistic spectrum and upon at least one reference nucleotide 
sequence H; and 

(c) computer readable program code for causing the computer to select a 
candidate nucleotide sequence having an essentially maximal score. 
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ABSTRACT 

A method for obtaining a nucleotide sequence that is indicative of the sequence of a 
target polynucleotide molecule T. The method makes use of hybridization data 
obtained by incubating T with nucleotide probes. A score is assigned to each of a 
5 plurality of candidate nucleotide sequences based upon the hybridization data and 
upon at least one reference nucleotide sequence. A candidate nucleotide sequence 
is then selected having an essentially maximal score. 
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